41 research outputs found
Learning Aligned Cross-Modal Representations from Weakly Aligned Data
People can recognize scenes across many different modalities beyond natural
images. In this paper, we investigate how to learn cross-modal scene
representations that transfer across modalities. To study this problem, we
introduce a new cross-modal scene dataset. While convolutional neural networks
can categorize cross-modal scenes well, they also learn an intermediate
representation not aligned across modalities, which is undesirable for
cross-modal transfer applications. We present methods to regularize cross-modal
convolutional neural networks so that they have a shared representation that is
agnostic of the modality. Our experiments suggest that our scene representation
can help transfer representations across modalities for retrieval. Moreover,
our visualizations suggest that units emerge in the shared representation that
tend to activate on consistent concepts independently of the modality.Comment: Conference paper at CVPR 201
Face-to-BMI: Using Computer Vision to Infer Body Mass Index on Social Media
A person's weight status can have profound implications on their life,
ranging from mental health, to longevity, to financial income. At the societal
level, "fat shaming" and other forms of "sizeism" are a growing concern, while
increasing obesity rates are linked to ever raising healthcare costs. For these
reasons, researchers from a variety of backgrounds are interested in studying
obesity from all angles. To obtain data, traditionally, a person would have to
accurately self-report their body-mass index (BMI) or would have to see a
doctor to have it measured. In this paper, we show how computer vision can be
used to infer a person's BMI from social media images. We hope that our tool,
which we release, helps to advance the study of social aspects related to body
weight.Comment: This is a preprint of a short paper accepted at ICWSM'17. Please cite
that version instea
Lossless Adaptation of Pretrained Vision Models For Robotic Manipulation
Recent works have shown that large models pretrained on common visual
learning tasks can provide useful representations for a wide range of
specialized perception problems, as well as a variety of robotic manipulation
tasks. While prior work on robotic manipulation has predominantly used frozen
pretrained features, we demonstrate that in robotics this approach can fail to
reach optimal performance, and that fine-tuning of the full model can lead to
significantly better results. Unfortunately, fine-tuning disrupts the
pretrained visual representation, and causes representational drift towards the
fine-tuned task thus leading to a loss of the versatility of the original
model. We introduce "lossless adaptation" to address this shortcoming of
classical fine-tuning. We demonstrate that appropriate placement of our
parameter efficient adapters can significantly reduce the performance gap
between frozen pretrained representations and full end-to-end fine-tuning
without changes to the original representation and thus preserving original
capabilities of the pretrained model. We perform a comprehensive investigation
across three major model architectures (ViTs, NFNets, and ResNets), supervised
(ImageNet-1K classification) and self-supervised pretrained weights (CLIP,
BYOL, Visual MAE) in 3 task domains and 35 individual tasks, and demonstrate
that our claims are strongly validated in various settings.Comment: ICLR'23, Project page see
https://sites.google.com/view/robo-adapters
TAPIR: Tracking Any Point with per-frame Initialization and temporal Refinement
We present a novel model for Tracking Any Point (TAP) that effectively tracks
any queried point on any physical surface throughout a video sequence. Our
approach employs two stages: (1) a matching stage, which independently locates
a suitable candidate point match for the query point on every other frame, and
(2) a refinement stage, which updates both the trajectory and query features
based on local correlations. The resulting model surpasses all baseline methods
by a significant margin on the TAP-Vid benchmark, as demonstrated by an
approximate 20% absolute average Jaccard (AJ) improvement on DAVIS. Our model
facilitates fast inference on long and high-resolution video sequences. On a
modern GPU, our implementation has the capacity to track points faster than
real-time, and can be flexibly extended to higher-resolution videos. Given the
high-quality trajectories extracted from a large dataset, we demonstrate a
proof-of-concept diffusion model which generates trajectories from static
images, enabling plausible animations. Visualizations, source code, and
pretrained models can be found on our project webpage.Comment: Published at ICCV 202